is one best type of investment becouse the statble,The benefits of investing in real estate are numerous With well-chosen assets, investors can enjoy predictable cash flow, excellent returns, tax advantages, and diversification—and it's possible to leverage real estate to build wealth. Real estate investors make money through rental income, any profits generated by property-dependent business activity, and appreciation. Real estate values tend to increase over time, and with a good investment, you can turn a profit when it's time to sell. Rents also tend to rise over time, which can lead to higher cash flow. This chart from the Federal Reserve Bank of St. Louis shows average home prices in the U.S. since 1963. The areas shaded in grey indicate U.S. recessions
our main goal is developing an algorithm that best predicts House Prices, allowing the real estate company decide on the best prices to practice in the market, bringing agility and a robust system behind it.
# update plotly and pandas_profiling version
!pip install --upgrade plotly
!pip install sweetviz
# import numpy, matplotlib, etc.
import math
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
# sklearn imports
from sklearn import metrics
from sklearn import pipeline
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import neural_network
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
train_df = pd.read_csv("/content/train.csv")
test_df = pd.read_csv("/content/test.csv")
test_id = test_df["Id"]
display(train_df)
now we will se all our features each column is a feature , another description is type
train_df.info()
import sweetviz as sw
house_prices_report = sw.analyze(train_df)
house_prices_report.show_notebook(layout='vertical')
#correlation matrix
corrmat = train_df.corr()
f, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(corrmat, vmax=.8, square=True);
as you see its hard to understand the map , so we will reduce the heatmap to the top 10 correlation features with our target (SalePrice).
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
f, ax = plt.subplots(figsize=(12, 9))
cm = np.corrcoef(train_df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
now we will check for outliers and then remove them.
var = 'OverallQual'
data = pd.concat([train_df['SalePrice'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
We can see that there are no significant outliers between these features.
var = 'GrLivArea'
data = pd.concat([train_df['SalePrice'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
We can see that there are two out liers with low price and big GrLivArea - lets drop them.
train_df.sort_values(by = 'GrLivArea', ascending = False)[:2]
train_df = train_df.drop(train_df[train_df['Id'] == 1299].index)
train_df = train_df.drop(train_df[train_df['Id'] == 524].index)
train_df.reset_index(drop=True,inplace=True)
TotalBsmtSF: Total square feet of basement area
var = 'TotalBsmtSF'
data = pd.concat([train_df['SalePrice'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
We can see that there are no significant outliers between these features.
GarageCars: Size of garage in car capacity
var = 'GarageCars'
data = pd.concat([train_df['SalePrice'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
We can see that there are no significant outliers between these features.
In the sweetWiz we can see 'Utilities' in all houses is 'Allpub' except for 1. 'Condition2' feature have 99% 'Norm' and 'RoofMatl' have 98% "CompShq', so we`ll drop that features.
Heatmap show us that 'GarageArea' and 'GarageCars' are higly correlated. Same for '1stFlrSF' and 'TotalbsmtSF' and 'GarageYrBlt' and 'YearBlt'. we decided to drop one feature of each pair (keep the one with the higher correlation with the target). Futrhermore 'GarageFinish' and 'GarageCond' are droped because they giving the similar info to 'GarageQual'
train_df = train_df.drop(columns=['GarageYrBlt','GarageArea','1stFlrSF','GarageFinish','GarageCond','Utilities','Condition2','RoofMatl'])
test_df = test_df.drop(columns=['GarageYrBlt','GarageArea','1stFlrSF','GarageFinish','GarageCond','Utilities','Condition2','RoofMatl'])
total = train_df.isnull().sum().sort_values(ascending=False)
percent = (train_df.isnull().sum()/train_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
now we will drop all columns with more than 20% missing values
train_df = train_df.drop(columns=['Id','Alley','PoolQC','Fence','MiscFeature','FireplaceQu'])
test_df = test_df.drop(columns=['Id','Alley','PoolQC','Fence','MiscFeature','FireplaceQu'])
def get_cmap(n, name='plasma'):
return plt.cm.get_cmap(name, n)
# plot target values by each feature
def plot_target_values_by_each_feature(df, target_column_name):
nrows = math.ceil(math.sqrt(len(df.columns)-1))
ncols = math.ceil((len(df.columns)-1)/nrows)
plt.style.use('seaborn')
fig, axes = plt.subplots(nrows, ncols)
plt.subplots_adjust(top=3, bottom=0, left=0, right=2.5)
colors = get_cmap(len(df.columns))
counter = 0
for i in range(len(df.columns)-1):
df.plot(kind='scatter', x=df.columns[i], y=target_column_name, ax=axes[i//nrows, i%nrows], color=colors(i))
axes[i//nrows, i%nrows].tick_params(axis='both', labelsize=10)
axes[i//nrows, i%nrows].xaxis.label.set_size(10)
axes[i//nrows, i%nrows].yaxis.label.set_size(10)
axes[i//nrows, i%nrows].title.set_fontsize(10)
for i in range(len(df.columns)-1, nrows*ncols):
fig.delaxes(axes.flatten()[i])
numerical_cols = train_df.select_dtypes(include=['int64', 'float64']).columns
df_numerical = train_df[numerical_cols]
plot_target_values_by_each_feature(df_numerical, 'SalePrice')
We will combine them because they seem to affect the target in similar way.
for df in [train_df, test_df]:
df["YearLstCnst"] = df[["YearBuilt", "YearRemodAdd"]].max(axis=1)
train_df = train_df.drop(columns = ["YearBuilt", "YearRemodAdd"])
test_df = test_df.drop(columns = ["YearBuilt", "YearRemodAdd"])
display(train_df["YearLstCnst"])
None of them have good correlation with the price, i will drop them.
train_df = train_df.drop(columns=['EnclosedPorch', '3SsnPorch','ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'])
test_df = test_df.drop(columns=['EnclosedPorch', '3SsnPorch','ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'])
It seem that the number of kitchens above ground dosent affect the price, in addition most of the houses have only one. I will drop that column.
train_df = train_df.drop(columns=['KitchenAbvGr'])
test_df = test_df.drop(columns=['KitchenAbvGr'])
total = train_df.isnull().sum().sort_values(ascending=False)
percent = (train_df.isnull().sum()/train_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
LotFrontage : Since the area of each street connected to the house property most likely have a similar area to other houses in its neighborhood , we can fill in missing values by the median LotFrontage of the neighborhood.
train_df["LotFrontage"] = train_df.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
test_df["LotFrontage"] = test_df.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 :
For all these categorical basement-related features, NaN means that there is no basement
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
train_df[col] = train_df[col].fillna('None')
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
test_df[col] = test_df[col].fillna('None')
MasVnrArea and MasVnrType :
NA most likely means no masonry veneer for these houses.
We can fill 0 for the area and None for the type.
train_df["MasVnrType"] = train_df["MasVnrType"].fillna("None")
train_df["MasVnrArea"] = train_df["MasVnrArea"].fillna(0)
test_df["MasVnrType"] = test_df["MasVnrType"].fillna("None")
test_df["MasVnrArea"] = test_df["MasVnrArea"].fillna(0)
GarageType and GarageQual : Replacing missing data with None
for col in ('GarageType', 'GarageQual'):
train_df[col] = train_df[col].fillna('None')
for col in ('GarageType', 'GarageQual'):
test_df[col] = test_df[col].fillna('None')
'Electrical' has only 1 missing value, so i fill it with the most frequent value.
train_df['Electrical'] = train_df['Electrical'].fillna(train_df['Electrical'].mode()[0])
test_df['Electrical'] = test_df['Electrical'].fillna(test_df['Electrical'].mode()[0])
Transforming some numerical variables that are really categorical
train_df['MSSubClass'] = train_df['MSSubClass'].apply(str)
test_df['MSSubClass'] = test_df['MSSubClass'].apply(str)
test_df.info()
total = test_df.isnull().sum().sort_values(ascending=False)
percent = (test_df.isnull().sum()/test_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
test_df['MSZoning'].describe()
we can see that 77% of the values in that feature is RL, then i will fill the NA's with RL.
test_df['MSZoning'] = test_df['MSZoning'].fillna('RL')
none in theese features means there is no basement
test_df[['BsmtHalfBath','BsmtFullBath','TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1']].describe()
for col in ('BsmtHalfBath','BsmtFullBath','TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1'):
test_df[col] = test_df[col].fillna(0)
test_df['Functional'].describe()
93% of the values is Typ
test_df['Functional'] = test_df['Functional'].fillna('Typ')
test_df['SaleType'].describe()
test_df["SaleType"] = test_df["SaleType"].fillna("WD")
test_df['GarageCars'].describe()
test_df["GarageCars"] = test_df["GarageCars"].fillna(test_df["GarageCars"].mean())
test_df[['Exterior1st', 'Exterior2nd']].describe()
test_df["Exterior1st"] = test_df["Exterior1st"].fillna("VinylSd")
test_df["Exterior2nd"] = test_df["Exterior2nd"].fillna("VinylSd")
test_df['KitchenQual'].describe()
test_df["KitchenQual"] = test_df["KitchenQual"].fillna("TA")
checking if all data is ready
test_df.isna().any()
from tqdm.auto import tqdm
def find_generator_len(generator, use_pbar=True):
i = 0
if use_pbar:
pbar = tqdm(desc='Calculating Length', ncols=1000, bar_format='{desc}{bar:10}{r_bar}')
for a in generator:
i += 1
if use_pbar:
pbar.update()
if use_pbar:
pbar.close()
return i
First, lets split that data to features and target.
t = train_df['SalePrice'].copy()
X = train_df.drop(['SalePrice'], axis=1).copy()
print('t')
display(t)
print()
print('X')
display(X)
# calculate score and loss from cv (KFold) and display graphs
from sklearn.model_selection import KFold
def get_cv_score_and_loss(X, t, model, k, show_score_loss_graphs=False, use_pbar=True):
scores_losses_df = pd.DataFrame(columns=['fold_id', 'split', 'score', 'loss'])
cv = KFold(n_splits=k, shuffle=True, random_state=1)
if use_pbar:
pbar = tqdm(desc='Computing Models', total=find_generator_len(cv.split(X)))
for i,(train_ids, val_ids) in enumerate(cv.split(X)):
X_train = X.loc[train_ids]
t_train = t.loc[train_ids]
X_val = X.loc[val_ids]
t_val = t.loc[val_ids]
model.fit(X_train, t_train)
y_train = model.predict(X_train)
y_val = model.predict(X_val)
scores_losses_df.loc[len(scores_losses_df)] = [i, 'train', model.score(X_train, t_train), mean_squared_error(t_train, y_train, squared=False)]
scores_losses_df.loc[len(scores_losses_df)] = [i, 'val', model.score(X_val, t_val), mean_squared_error(t_val, y_val, squared=False)]
if use_pbar:
pbar.update()
if use_pbar:
pbar.close()
val_scores_losses_df = scores_losses_df[scores_losses_df['split']=='val']
train_scores_losses_df = scores_losses_df[scores_losses_df['split']=='train']
mean_val_score = val_scores_losses_df['score'].mean()
mean_val_loss = val_scores_losses_df['loss'].mean()
mean_train_score = train_scores_losses_df['score'].mean()
mean_train_loss = train_scores_losses_df['loss'].mean()
if show_score_loss_graphs:
fig = px.line(scores_losses_df, x='fold_id', y='score', color='split', title=f'Mean Val Score: {mean_val_score:.2f}, Mean Train Score: {mean_train_score:.2f}')
fig.show()
fig = px.line(scores_losses_df, x='fold_id', y='loss', color='split', title=f'Mean Val Loss: {mean_val_loss:.2f}, Mean Train Loss: {mean_train_loss:.2f}')
fig.show()
return mean_val_score, mean_val_loss, mean_train_score, mean_train_loss
We will encode the categorical features with OHE and standard scalar on the numerical features.
from sklearn.compose import ColumnTransformer
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
ct = ColumnTransformer([
("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
("standard", StandardScaler(), numerical_cols)])
model_pipe = make_pipeline(ct, SGDRegressor(random_state=1))
# model = SGDRegressor(random_state=1)
val_score, val_loss, train_score, train_loss = get_cv_score_and_loss(X, t, model_pipe, k=10, show_score_loss_graphs=True)
print(f'mean cv val score: {val_score:.2f}\nmean cv val loss {val_loss:.2f}')
print(f'mean cv train score: {train_score:.2f}\nmean cv train loss {train_loss:.2f}')
We will use Scikit-learn RFECV that is based on the Backward Feature Selection.
The default CV is 5-fold cross-validation.
We will enter the Scikit-learn RepeatedKFold to repeat each KFold a few times with different splits.
# find best subset of features on this dataset
from sklearn.feature_selection import RFECV
from sklearn.model_selection import RepeatedKFold
df = X.copy()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct = ColumnTransformer([
("encoding", OrdinalEncoder(), categorical_cols),
("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct.fit_transform(X, t),columns=all_cols)
# model_pipe = make_pipeline(ct, SGDRegressor(random_state=1))
selector = RFECV(SGDRegressor(random_state=1), cv=RepeatedKFold(n_splits=5, n_repeats=5, random_state=1)).fit(X_encoded, t)
display(X_encoded.loc[:, selector.support_])
best_features = selector.support_
fig = go.Figure()
fig.add_trace(go.Scatter(x=[i for i in range(1, len(selector.grid_scores_) + 1)], y=selector.grid_scores_))
fig.update_xaxes(title_text="Number of features selected")
fig.update_yaxes(title_text="Cross validation score (nb of correct classifications)")
fig.show()
print(X.loc[:, best_features].keys())
print("Number of features: {}".format(len(X.loc[:, best_features].keys())))
X_best = X.loc[:,best_features]
X_best
# show graph of score and loss by plynomial degree of numerical features
def show_degree_graphs_cv_train(X, t, model, k, max_degree=10):
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
val_train_score_loss_df = pd.DataFrame(columns=['degree', 'split', 'score', 'loss'])
for i in tqdm(range(1, max_degree), desc='Poly Degree'):
ct_enc_std_poly = ColumnTransformer([
("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
("standard_poly", make_pipeline(PolynomialFeatures(degree=i), StandardScaler()), numerical_cols)])
model_pipe = make_pipeline(ct_enc_std_poly, model)
# model_pipe = make_pipeline(PolynomialFeatures(degree=i),model)
val_score, val_loss, train_score, train_loss = get_cv_score_and_loss(X, t, model_pipe, k=k, show_score_loss_graphs=False, use_pbar=False)
val_train_score_loss_df.loc[len(val_train_score_loss_df)] = [i, 'train', train_score, train_loss]
val_train_score_loss_df.loc[len(val_train_score_loss_df)] = [i, 'cv', val_score, val_loss]
fig = px.line(val_train_score_loss_df, x='degree', y='score', color='split')
fig.show()
fig = px.line(val_train_score_loss_df, x='degree', y='loss', color='split')
fig.show()
max_val = val_train_score_loss_df["score"].max()
best_degree = val_train_score_loss_df[val_train_score_loss_df["score"] == max_val]["degree"].to_numpy()[0]
return best_degree
best_degree = show_degree_graphs_cv_train(X_best, t, SGDRegressor(random_state=1), k=10 ,max_degree=5)
In both graphs above score and loss the best result is 2 . The features that we selected chosen by the feature selection function.
Before, we tried to use all the 57 features that left after the data research, and it has lower score on the CV and on the final test.
Let`s change some hyper-parameters of the SGD.
def choose_best_lr(x,t):
scores = pd.DataFrame(columns=["lr", "val_score", "val_loss", "train_score", "train_loss"])
lr=0.0001
for i in range(1000):
selector = SGDRegressor(random_state=1, eta0=lr, learning_rate="constant").fit(x, t)
mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=10, show_score_loss_graphs=False, use_pbar=False)
if mean_val_score < 0:
break
scores.loc[len(scores)] = [lr, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]
lr += 0.0001
fig = go.Figure()
fig.add_trace(go.Scatter(x=scores["lr"], y=scores["val_score"]))
fig.update_xaxes(title_text="Learning Rate")
fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
fig.show()
max_val = scores["val_score"].max()
best_lr = scores[scores["val_score"] == max_val]["lr"].to_numpy()[0]
return best_lr, max_val
X_ = X.loc[:,best_features]
numerical_cols = X_.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct = ColumnTransformer([
("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct.fit_transform(X_, t))
best_lr,best_score = choose_best_lr(X_encoded,t)
print("Best learning rate : {}".format(best_lr))
We will use diffrent regularization methods as we stuidied in class:
def choose_regularization(x,t):
scores = pd.DataFrame(columns=["penalty", "val_score", "val_loss", "train_score", "train_loss"])
for penalty in ['l1','l2','elasticnet']:
selector = SGDRegressor(penalty=penalty ,random_state=1, eta0=best_lr, learning_rate="constant").fit(x, t)
mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=10, show_score_loss_graphs=False, use_pbar=False)
scores.loc[len(scores)] = [penalty, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]
fig = go.Figure()
fig.add_trace(go.Scatter(x=scores["penalty"], y=scores["val_score"]))
fig.update_xaxes(title_text="penalty")
fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
fig.show()
max_val = scores["val_score"].max()
best_penalty = scores[scores["val_score"] == max_val]["penalty"]
return best_penalty, max_val
X_ = X.loc[:,best_features]
numerical_cols = X_.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct = ColumnTransformer([
("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct.fit_transform(X_, t))
best_penalty,best_score = choose_regularization(X_encoded,t)
print("Best penalty : {}".format(best_penalty))
We can see that l2 = ridge give us the best cv score.
best_penalty = 'l2'
Now we will do the same for the regularization parameter of the SGD.
def choose_best_alpha(x,t):
scores = pd.DataFrame(columns=["alpha", "val_score", "val_loss", "train_score", "train_loss"])
alpha=0.0001
for i in tqdm(range(100), desc='Alpha'):
selector = SGDRegressor(penalty = best_penalty, alpha=alpha, random_state=1, eta0=best_lr, learning_rate="constant").fit(x, t)
mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=10, show_score_loss_graphs=False, use_pbar=False)
if mean_val_score < 0:
break
scores.loc[len(scores)] = [alpha, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]
alpha += 0.0001
fig = go.Figure()
fig.add_trace(go.Scatter(x=scores["alpha"], y=scores["val_score"]))
fig.update_xaxes(title_text="alpha")
fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
fig.show()
max_val = scores["val_score"].max()
best_alpha = scores[scores["val_score"] == max_val]["alpha"].to_numpy()[0]
return best_alpha, max_val
X_ = X.loc[:,best_features]
numerical_cols = X_.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct = ColumnTransformer([
("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct.fit_transform(X_, t))
best_alpha,best_score = choose_best_alpha(X_encoded,t)
print("Best learning rate : {}".format(best_alpha))
Now we will use the result of our research to build a model.
print("Number of features = {}, Degree = {}, Penalty = {}, Learning rate = {}, alpha = {}".format(len(X.loc[:, best_features].keys()),best_degree,best_penalty,best_lr, best_alpha))
features = X.loc[:,best_features].keys()
X_train = X[features]
display(X_train)
ct = ColumnTransformer([
("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
("standard", make_pipeline(PolynomialFeatures(degree=2), StandardScaler()), numerical_cols)])
model_pipe = make_pipeline(ct, SGDRegressor(penalty='elasticnet', random_state=1, alpha=best_alpha, eta0=best_lr, learning_rate="constant"))
model_pipe.fit(X_train,t)
y_train = model_pipe.predict(X_train)
rmse = mean_squared_error(t, y_train, squared=False)
print("Final RMSE: {}".format(rmse))
X_test = test_df[features]
X_test
X_test.info()
y_test = model_pipe.predict(X_test)
submission = pd.DataFrame({
"Id": test_id,
"SalePrice": y_test
})
submission.to_csv('submission5.csv', index=False)
In this assginment were asked to predict the house prices in Ames, Lowa.
Our data is assembled from 80 features some of them were related and some didn`t , data exloring and research was important and because of the size of the features there was a lot of work ,finding out which of the features have the biggest potential to predict the targets value.
To achive that goal we used correlation methods and graphs that showed us some information about the relationship between the features and the target .
In addition, we used CV and K-Fold to find out the best degree, regularization method and some hyper-parameters such as alpha and learning rate,these all helped to learn the model in the best way.
insights: